Contents

ix

4.2.5

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.3

CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs . . . . .

98

4.3.1

Child-Parent Model for Network Binarization . . . . . . . . . . . . .

100

4.3.2

Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102

4.3.3

Search Strategy for CP-NAS

. . . . . . . . . . . . . . . . . . . . . .

103

4.3.4

Optimization of the 1-Bit CNNs

. . . . . . . . . . . . . . . . . . . .

103

4.3.5

Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

104

4.4

DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit

CNNs

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

105

4.4.1

Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

105

4.4.2

Redefine Child-Parent Framework for Network Binarization . . . . .

107

4.4.3

Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

108

4.4.4

Tangent Propagation for DCP-NAS

. . . . . . . . . . . . . . . . . .

109

4.4.5

Generalized Gauss-Newton Matrix (GGN) for Hessian Matrix . . . .

110

4.4.6

Decoupled Optimization for Training the DCP-NAS . . . . . . . . .

111

4.4.7

Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

115

5

Applications in Natural Language Processing

118

5.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

118

5.1.1

Quantization-Aware Training (QAT) for Low-Bit Large Language

Models

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

118

5.1.2

Post-Training Quantization (PTQ) for Low-Bit Large Language

Models

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

118

5.1.3

Binary BERT Pre-Trained Models . . . . . . . . . . . . . . . . . . .

119

5.2

Fully Quantized Transformer for Machine Translation

. . . . . . . . . . . .

121

5.2.1

Quantization Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . .

121

5.2.2

What to Quantize

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

122

5.2.3

Tensor Bucketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

123

5.2.4

Dealing with Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . .

124

5.3

Q-BERT: Hessian-Based Ultra Low-Precision Quantization of BERT . . . .

125

5.3.1

Hessian-Based Mix-Precision

. . . . . . . . . . . . . . . . . . . . . .

125

5.3.2

Group-Wise Quantization . . . . . . . . . . . . . . . . . . . . . . . .

125

5.4

I-BERT: Integer-Only BERT Quantization . . . . . . . . . . . . . . . . . . .

127

5.4.1

Integer-Only Computation of GELU and Softmax

. . . . . . . . . .

128

5.4.2

Integer-Only Computation of LayerNorm

. . . . . . . . . . . . . . .

128

5.5

Toward Efficient Post-Training Quantization of Pre-Trained Language

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

5.5.1

Module-Wise Reconstruction Error Minimization . . . . . . . . . . .

129

5.5.2

Model Parallel Strategy . . . . . . . . . . . . . . . . . . . . . . . . .

130

5.5.3

Annealed Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . .

130

5.6

Outlier Suppression: Pushing the Limit of Low-Bit Transformer Language

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132

5.6.1

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132

5.6.2

Gamma Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

5.6.3

Token-Wise Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . .

134

5.7

BinaryBERT: Pushing the Limit of BERT Quantization . . . . . . . . . . .

134

5.7.1

Ternary Weight Splitting

. . . . . . . . . . . . . . . . . . . . . . . .

136

5.7.2

Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . .

136

5.8

BEBERT: Efficient and Robust Binary Ensemble BERT . . . . . . . . . . .

138

5.9

BiBERT: Accurate Fully Binarized BERT . . . . . . . . . . . . . . . . . . .

139

5.9.1

Bi-Attention

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

139